1 



LSI DOCKET NO. 02-6471 

APPLICATION FOR 
LETTERS PATENT OF THE UNITED STATES 

CERTIFICATE OF MAILING BY "EXPRESS MAIL" j 
"EXPRESS MAIL" Mailing Label Number EV333421945US 
Date of Deposit: fitkAjU^ lS)t20o3 

I HEREBY CERTIFY THAT THIS CORRESPONDENCE, CONSISTING OF 31 
PAGES OF SPECIFICATION AND 11 PAGES OF DRAWINGS, IS BEING 
DEPOSITED WITH THE UNITED STATES POSTAL SERVICE "EXPRESS MAIL 
POST OFFICE TO ADDRESSEE" SERVICE UNDER 37 CFR 1.10 ON THE DATE 
INDICATED ABOVE AND IS ADDRESSED TO: MAIL STOP PATENT 
APPLICATION, COMMISSIONER FOR PATENTS, P.O. BOX 1450, ALEXANDRIA, 
VA 22313-1450. 

by: kJ/H£p* /rfodAj&f 

Amefia C. Nearing ' ( f 

SPECIFICATION 

To all whom it may concern: 

Be It Known, That We, Paresh Chatterjee, a citizen of the United States of America, 
residing at 660 Monticello Terrace, Fremont, California 94539 and Basavaraj 
Gurupadappa Hallyal, a citizen of India, residing at 39600 Fremont Boulevard, Apartment 
#113, Fremont, California 94538 and Senthil Murugan Thangaraj, a citizen of India, 
residing at 40934, Inglewood Common, Fremont, California 94538 and Narasimhulu 
Dharanikumar Kotte, a citizen of India, residing at 3300, Wolcott Common, Apartment 
#218, Fremont, California 94538 and Ramya Subramanian, a citizen of India, residing at 
39663 Leslie Street, Apartment #305, Fremont, California 94538 have invented certain new 
and useful improvements in "Method, Apparatus and Program for Migrating Between 
Striped Storage and Parity Striped Storage", of which We declare the following to be a full, 
clear and exact description: 
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BACKGROUND OF THE INVENTION 



1. Technical Field: 

The present invention is directed generally toward storage systems and, more particularly, 
5 to a method, apparatus, and program for migrating between striped storage and parity striped 
storage. 



2, Description of the Related Art: 

Redundant Array of Independent Disks (RAID) is a disk subsystem that is used to 

1 0 increase performance and/or provide fault tolerance. RAID is a set of two or more ordinary hard 
disks and a specialized disk controller that contains RAID functionality. RAID can also be 
implemented via software only, but with less performance, especially when rebuilding data after 
a failure. RAID improves performance by disk striping, which interleaves bytes or groups of 
bytes across multiple drives, so more than one disk is reading and writing simultaneously. Fault 

1 5 tolerance is achieved by mirroring or parity. 

There are several levels of RAID that are common in current computer systems. RAID 
level 0 is disk striping only, which interleaves data across multiple disks for better performance. 
RAID level 1 uses disk mirroring, which provides 100% duplication of data. Offers highest 
reliability, but doubles storage cost. In RAID level 3, data are striped across three or more 

2 0 drives. This level is used to achieve the highest data transfer, because all drives operate in 

parallel. Parity bits are stored on separate, dedicated drives. RAID level 5 is perhaps the most 
widely used. Data are striped across three or more drives for performance, and parity bits are 
used for fault tolerance. The parity bits from all drives but one are stored on a remaining drive, 
which alternates among the three or more drives. 

2 5 Day by day the need for data storage is increasing. This demands the addition of more 

drives, which leads to migration of the existing volume to a new volume. Migration is 
conventionally done in two ways. One-way of doing is the Online Capacity Expansion (OCE) 
and the other way is the RAID Level Migration (RLM). OCE can be defined as the addition of 
RAID capacity onto new disk drives without power-down or reboot. The existing volumes on 
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the array will remain accessible during the expansion process. RLM allows the user to migrate a 
RAID volume from one RAID level to another without power-down or reboot. The volumes will 
remain accessible during the migration process. 

Need for RLM arises from the fact that customers are demanding reliable ways to protect 
5 large volumes of data stored across an increasing number of disk drives. RAID technology 
allows a group of disk drives to be "tied" together to act as a single logical disk drive from the 
operating system perspective, providing increased performance and fault tolerance. For example, 
one can add a single drive to four previously existing drives, configured as RAID 0 and 
reconstruct these drives to RAID 5 with no data being lost or corrupted during the migration 
1 0 process. With RAID 1 , it is expensive to create large volumes based upon the consumption of 
disk drives for mirroring so generally we go for RAID 5. 

Therefore, it would be advantageous to provide an improved and more efficient 
mechanism for migrating between stripe storage and redundant parity striped storage. 
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SUMMARY OF THE INVENTION 

The present invention provides an efficient mechanism for migration between stripe storage 
and redundant parity striped storage. When a disk is added to a disk array, the mechanism of the 
5 present invention migrates from RAID 0 to RAID 5. For each row, the mechanism calculates a 
parity stripe position, calculates a parity for the row and, if the parity position is the new drive, 
writes the parity to the parity stripe position. If, however, the parity position is not the new drive, 
the mechanism writes the data from the parity position to the new drive and writes the parity to the 
parity stripe position. If a drive fails, the mechanism of the present invention migrates back from 

1 0 RAID 5 to RAID 0. For each row, the mechanism calculates a parity stripe position and, if the 
parity stripe position is the failed drive, writes the failed drive data to the parity position. If, 
however, the parity position is not the failed drive, the mechanism reads the data from remaining 
drives, XORs the data stripes to get failed drive data, and writes the failed drive data to the parity 
position. If a read or write is received for the failed drive, the mechanism of the present invention 

1 5 simply redirects the read or write to the parity position. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

The novel features believed characteristic of the invention are set forth in the appended 
claims. The invention itself however, as well as a preferred mode of use, further objects and 
5 advantages thereof, will best be understood by reference to the following detailed description of an 
illustrative embodiment when read in conjunction with the accompanying drawings, wherein: 

Figure 1 is a pictorial representation of a data processing system in which the present 
invention may be implemented in accordance with a preferred embodiment of the present 
invention; 

1 0 Figure 2 is a block diagram of a data processing system is shown in which the present 

invention may be implemented; 

Figure 3 is a block diagram illustrating a prior art RAID migration algorithm; 
Figure 4 is a block diagram illustrating a RAID migration algorithm in accordance with a 
preferred embodiment of the present invention; 
15 Figure 5 is a flowchart illustrating the operation of RAID migration in accordance with a 

preferred embodiment of the present invention; 

Figures 6A and 6B are block diagrams illustrating reconstruction of a degraded RAID 5 
volume to RAID 0 in accordance with a preferred embodiment of the present invention; 

Figure 7 is a flowchart illustrating the operation of a process of reconstructing a degraded 
2 0 RAID 5 volume to RAID 0 in accordance with a preferred embodiment of the present invention; 

Figure 8 is a block diagram illustrating an algorithm for reconstructing modified RAID 0 
to RAID 5 in accordance with a preferred embodiment of the present invention; 

Figure 9 is a flowchart illustrating the operation of a reconstruction algorithm in 
accordance with a preferred embodiment of the present invention; 
2 5 Figures 10A, 10B, and 11-13 are block diagrams illustrating access operations to 

degraded RAID 5 according to the prior art; 

Figures 14A and 14B are block diagrams illustrating access operations to degraded RAID 
5 in accordance with a preferred embodiment of the present invention; and 
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Figure 15 is a flowchart illustrating the operation of a read/write access in accordance 
with a preferred embodiment of the present invention. 
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DETAILED DESCRIPTION 

The description of the preferred embodiment of the present invention has been presented 
for purposes of illustration and description, but is not intended to be exhaustive or limited to the 
5 invention in the form disclosed. Many modifications and variations will be apparent to those of 
ordinary skill in the art. The embodiment was chosen and described in order to best explain the 
principles of the invention the practical application to enable others of ordinary skill in the art to 
understand the invention for various embodiments with various modifications as are suited to the 
particular use contemplated. 

1 0 With reference now to the figures and in particular with reference to Figure 1, a pictorial 

representation of a data processing system in which the present invention may be implemented is 
depicted in accordance with a preferred embodiment of the present invention. A computer 102 
is depicted which is connected to disk array 120 via storage adapter 110. Computer 102 may be 
implemented using any suitable computer, such as an IBM eServer computer or IntelliStation 

15 computer, which are products of International Business Machines Corporation, located in 
Armonk, New York. 

In the depicted example, disk array 120 includes disk 0, disk 1, disk 3, and disk 4. 
However, more or fewer disks may be included in the disk array within the scope of the present 
invention. In accordance with a preferred embodiment of the present invention, a disk may be 

2 0 added to the disk array, such as disk X in Figure 1 . The RAID system must then migrate data to 
the new disk to use the disk within the array. For example, and in accordance with the exemplary 
aspects of the present invention, the original disk array 120 may be a RAID level 0 array. In 
other words, the disk array may use disk striping only, which interleaves data across multiple 
disks for better performance. When disk X is added to the array, the RAID system, including 

25 computer 102 and storage adapter 110, may migrate from RAID level 0 to RAID level 5 to stripe 
data across the drives for performance and to use parity bits for fault tolerance. 

With reference now to Figure 2, a block diagram of a data processing system is shown in 
which the present invention may be implemented. Data processing system 200 is an example of a 
computer, such as computer 102 in Figure 1, in which storage adapter 210 implements the present 
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invention. Data processing system 200 employs a peripheral component interconnect (PCI) local 
bus architecture. Although the depicted example employs a PCI bus, other bus architectures such 
as Accelerated Graphics Port (AGP) and Industry Standard Architecture (ISA) may be used. 
Processor 202 and main memory 204 are connected to PCI local bus 206 through PCI bridge 208. 
5 PCI bridge 208 also may include an integrated memory controller and cache memory for processor 
202. Additional connections to PCI local bus 206 may be made through direct component 
interconnection or through add-in boards. 

In the depicted example, storage adapter 210, local area network (LAN) adapter 212, and 
expansion bus interface 214 are connected to PCI local bus 206 by direct component connection. In 

1 0 contrast, audio adapter 216, graphics adapter 218, and audio/video adapter 219 are connected to 
PCI local bus 206 by add-in boards inserted into expansion slots. Expansion bus interface 214 
provides a connection for a keyboard and mouse adapter 220, modem 222, and additional memory 
224. Storage adapter 210 provides a connection for hard disk drives, such as disk array 120 in 
Figure 1 . Typical PCI local bus implementations will support three or four PCI expansion slots or 

1 5 add-in connectors. 

An operating system runs on processor 202 and is used to coordinate and provide control of 
various components within data processing system 200 in Figure 2. The operating system may be 
a commercially available operating system such as AIX, which is available from IBM Corporation. 
An object oriented programming system such as Java may run in conjunction with the operating 

2 0 system and provides calls to the operating system from Java programs or applications executing on 
data processing system 200. "Java" is a trademark of Sun Microsystems, Inc. Instructions for the 
operating system, the object-oriented programming system, and applications or programs are 
located on storage devices, such as hard disk drives, and may be loaded into main memory 204 for 
execution by processor 202. 

2 5 Those of ordinary skill in the art will appreciate that the hardware in Figure 2 may vary 

depending on the implementation. Other internal hardware or peripheral devices, such as flash 
read-only memory (ROM), equivalent nonvolatile memory, or optical disk drives and the like, 
may be used in addition to or in place of the hardware depicted in Figure 2. Also, the processes 
of the present invention may be applied to a multiprocessor data processing system. 
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1. Terminology 

1.1. Migration 

Migration is a process of converting a RAID volume from one RAID level to another or 
expanding the capacity of the existing volume. 

1.2. Temporary Migration Stripe 

The Temporary Migration Stripe is an extra stripe, with a size same as RAID volume stripe 
size at the end of each disk. In the prior art, this is required to store the read data from each 
disk before transferring to the new RAID volume. 

1.3. Temporary Migration Row 

All the Temporary Migration Stripes of each disk from the current RAID volume put together 
is represented as single Temporary Migration Row. 

2. Prior Art 

The existing implementation for RLM is more complex and time consuming. A brief 
explanation of the present algorithm is presented taking an example. Consider a RAID 0 volume 
with M disks. By adding one disk to the existing M disks of RAID 0 volume we can reconstruct a 
RAID 5 volume of N disks, where N= M+l . 

2.1. Prior Art Algorithm 

Figure 3 is a block diagram illustrating a prior art RAID migration process. 

2.1.1. Migrating RAID 0 to RAID 5 
Assumptions: 

• Let 'AT be the maximum number of disks present in RAID 0. 

• Number of disks that will be added to RAID 0 to migrate to RAID5 is 1 . 

• '1\T be the number of disks present in RAID 5 . 

• Temporary Migration Stripe is available for each disk of RAID 0. 
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• Newly added disk number is X. 

• Initialize the non-volatile area variable migratedRow to 0. The non-volatile area 
variable may be an area in the disk itself or non- volatile read/write memory (NVRAM), 
for example. 

• *£' Maximum number of rows to be migrated. 

• Current row is locked for other VO operations during migration. 

Referring to Figure 3, let {A > D 2 , D 3 ,D 4 } be the *Af disks present in RAID 0 and 

, R 2 , /? 3 , i? 4 , R 5 } be the 'Z' rows that to be migrated from RAID 0 to RAID 5. The following 
steps are to be performed for each row in the migration process. 
For each row {R\ , R2 > ^3 > ^4 > ^5 } 

Step 1 : READ the data row from 'M drives present in RAID 0. 

Step 2: WRITE the read data into the Temporary Migration Row, which includes a big 

"seek" to the end of the disk. The "seek" is shown in Fig. 1 as a solid line ^i . 

R 

Step 3: Calculate the parity stripe position P pos and data stripe positions Data pos for N 

disks of RAID 5. The formula to find the parity position is given in Section 2.2. 

R 

Step 4: Now calculate the parity P using the read data from RAIDO. The formula to find 
the parity is described below. 

Step 5: Write the data stripes and parity in the corresponding row (say R\ in our example) 
in N disks of RAID 5 based on the positions calculated in Step 3. During this write a seek 
operation is performed to the start of the row, which is shown in the Figure 3 as a solid line 

Step 6: Update non-volatile area variable migratedRow with current row number. This is to 
take care of the fault tolerance in case of power failures. 

2.2. Formulae 

• Calculating Parity Position: 
R 

Ppos = ( R > O , where R is the Row and C is the disk number 
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'C is defined as C = (N - ((R)MOD(N))) , where Wis the number of 
disks present in RAID 5 

• Calculating Parity: 

R pos-l N 

P = © (R, i) © © (RJ) , where R is the Row and pos is parity 
z=0 j=pos+l 

position and 6 N* is the number of disks present in RAID 5 

• Calculating Data Position (0 to M) for each row: 
If (D < P* os ) then Data = D 

Else Data pos =D + l 

Where D is 0 to M 

2.3. Algorithmic Complexity 

For each row {R\ 5 ^2 •> R3 -> R\ » ^5 } the read- write complexity can be calculated as follows: 

In Figure 3 consider row R\ in RAID 0. The prior art algorithm must perform 
'Af Reads to read M drives of RAID 0 and 'AT writes to same drives in the Temporary 
Migration Row to save the read data into a non- volatile area in order to take care of the 
power failure case along with a "big seek" ^1 in the drive spindle. 

Calculate the parity and write back the data and parity in N drives. Again this 
involves big seek to the same row where the data has been read. 

To migrate each row involves 'Af Reads, '2M + 1 ' writes and Af - 1 exclusive 
OR (XOR) operations with two seeks in the drive. 

As c Af increases, for each row to migrate the prior art migration process will take 
more processing time and bandwidth. If migration occurs during I/O, the performance 
decreases. 

3. Efficient RLM Algorithms 

Consider the same RAID 0 volume with Af disks as that is taken in the Prior Art algorithm. 
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By adding one disk to the existing M disks of RAID 0 volume the RAID migration algorithm of the 
present invention may reconstruct a RAID 5 volume of N disks, where N = M + 1 . Figure 4 is a 
block diagram illustrating a RAID migration algorithm in accordance with a preferred embodiment 
of the present invention. 

3.1. Migrating RAID 0 to RAID 5 

The following steps are performed to reconstruct a RAID 5 volume from RAID 0 volume 
using the new RLM algorithm. 
Assumptions: 

• Let 6 Af be the maximum number of disks present in RAID 0. 

• Number of disks that will be added to RAID 0 to migrate to RAIDS is 1 . 

• '1ST be the number of disks present in RAID 5 ( N = M + 1). 

• Newly added disk number is X. 

• Initialize the Non- Volatile area variables migratedRow and dataMigratedFlag to 0. The 
non-volatile area may be an area in the disk itself or NVRAM. 

• Maximum number of rows to be migrated 

• Current row is locked for other I/O operations during migration 

Referring to Figure 4, let {A > D 2 ,D 3 ,D 4 } be the 'AT disks present in RAID 0 and 

{#! , R 2 , £ 3 ,R 4 ,R 5 } be the T rows that to be migrated from RAID 0 to RAID 5. 

With reference now to Figure 5, a flowchart is shown illustrating the operation of RAID 
migration in accordance with a preferred embodiment of the present invention. The process begins 
and for each row, {^i , R2 > ^3 > ^4 > R 5 } in the example shown in Figure 4 (step 502), the process 
reads the data row from 'AT drives present in RAID 0 (step 504) and calculates the parity stripe 
position Ppos (step 506). The formula to find the parity position and data position is discussed 

R 

below. Then, the process calculates the parity P using the read data from RAID0 (step 508). The 
formula to calculate the parity is given below. 
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Next, a determination is made as to whether the calculated parity stripe position P pos falls 
in newly added disk X (step 510). If the calculated parity stripe position P pos falls other than disk 

X 9 which is a part of RAIDO, the process writes the data stripe read from the position P pos into the 

newly added diskXof the corresponding row, R\ in the depicted example (step 512). Then the 
process sets the non- volatile variable flag dataMigratedFlag as TRUE to indicate completion of 
data migration to the new volume (step 514). Following step 514 or a determination that the parity 
position falls in the newly added disk in step 510, the process writes the parity into the parity stripe 

position P P os (step 516). Thereafter, the process updates non-volatile variable migratedRow with 

the current row number to take care of the fault tolerance in case of power failures (step 518) and 
resets the non-volatile variable flag dataMigratedFlag to FALSE (step 520). The process repeats 
until the last row is reached (step 522) and the process ends. 

3.2. Formulae 

• Calculating Parity Position: 

P pos = CR> Q , where R is the Row and C is the disk number 

'C is defined as C = (N - ((R)MOD(N))) , where N is the number of 
disks present in RAID 5 

• Calculating Parity: 

R pos-\ N 
P = 0 (/?,/)© © (R 9 j ) 9 where R is the Row and pos is parity 
i=0 j=pos+\ 

position and 4 AT is the number of disks present in RAID 5 

• Calculating Data Position (0 to M) for each row: 
If(D^P^)then Data pa =D 

Else Data pos = X 9 where X is the newly added disk 
WhereDisOtoM 
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3.3. Algorithmic complexity 

For each row {&\ , Ri , #3 > ^4 > ^5 } the read-write complexity of the new RLM algorithm 
can be calculated as follows: 

In Figure 4 consider row ^1 in RAID 0. The algorithm of the present invention 
5 performs 'AT Reads to read M drives of RAID 0 and we need to perform (M - 1) XOR 

operations to calculate the parity. 

The process of migration from RAID 0 to RAID 5, in accordance with the exemplary 
aspects of the present invention, needs to update only the parity position and the newly 
inserted drive data stripe. All other stripes remain the same. Therefore, only two write 
1 0 operations are required per row in the algorithm of the present invention. 

To migrate each row, the process of the present invention involves 'AT Reads, one 
write for the first row in each 'NT rows or two writes for the remaining M - 1 rows and M- 1 
XOR operations. 

In the algorithm of the present invention, there is no need to make a copy of the 
1 5 existing data to the same drives and, hence, the two time consuming "Big Seeks" per row are 

avoided. Also, for any row there will be maximum two writes. Compared to the prior art 
algorithm, the number of writes is drastically reduced. 

As 'Af increases, the number of writes remains the same. The processing time and 
bandwidth is drastically reduced when compared to the prior art algorithm. 
2 0 If migration occurs during I/O, the performance is not affected when compared to the 

prior art algorithm. 

4. An Efficient Way of Handling Degraded RAID 5 

In RAID 0, I/O performance is always better than degraded RAID 5, because there are no 
2 5 XOR operations. Therefore, it is better to reconstruct the degraded RAID 5 volume to RAID 0 

volume. Figures 6A and 6B are block diagrams illustrating reconstruction of a degraded RAID 5 
volume to RAID 0 in accordance with a preferred embodiment of the present invention. 
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4.1. Algorithm for Reconstruction of Degraded RAID 5 to RAID 0 

It is required to keep the old RAID 5 configuration if the failed drive of the degraded RAID 
5 is other than last drive of the logical array. Figure 6A shows a degraded RAID 5 with the last 
drive failed. Referring to Figure 6A, we see that 

{D h D 2 ,D 3 ,D 4 } are the online drives and is 

5 the failed drive of the total 'TV disks present in the degraded RAID 5. The reconstruction 

mechanism of the present invention converts the degraded RAID 5 to RAID 0 by replacing the 
parity stripes of the degraded RAID 5 with the data stripes of the failed disk 

Figure 6B shows a degraded RAID 5 with the drive failed other than the last drive. 
Referring to Figure 6B, {D\ ,D 2 ,D 4 ,X} are the online drives and {^3 } , represented as {z} 9 is the 
1 0 failed drive of the total W disks present in the degraded RAID 5. The reconstruction mechanism 
of the present invention converts the degraded RAID 5 to RAID 0 by replacing the parity stripes of 
the degraded RAID 5 with the data stripes of the failed disk 

Assumptions: 

• Let W be the number of disks present in RAID 5. 

15 • Let 'AT be the maximum number of disks present in RAID 0. 

• Let the failed disk number be Z (refer to Figure 6B). 

• Initialize the non- volatile area variable reconstructedRow and savedStripe equal to 
0. The non- volatile area may be an area in the disk itself or NVRAM. 

• TemporaryStripe as an extra stripe, having a size that is same as the RAID volume 
2 0 stripe size, at the end of the disk (Drive X) to store calculated failed drive data 

before transferring to the actual data drive. 

• Current row is locked for other I/O operations during reconstruction. 

Figure 7 is a flowchart illustrating the operation of a process of reconstructing a degraded 
RAID 5 volume to RAID 0 in accordance with a preferred embodiment of the present invention. 
2 5 The process begins and, for each row {R\ > Ri , #3 > R 4 > R 5 } (step 702), the process calculates the 

parity stripe position Pp 0S for N disks of RAID 5 (step 704). Then, a determination is made as to 
whether P pos falls in failed drive Z (step 706). 
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If the parity position does not fall in the failed drive, the process reads the data row from M 
drives present in the degraded RAID 5 (step 708), XORs the data stripes with parity stripe to get the 
failed drive data (step 710), and saves the failed drive data in the TemporaryStripe (step 712). 
Following step 712 or a determination that the parity position falls in the failed drive in step 706, 
5 the process update non- volatile variable savedStripe with current ROW number to take care of the 
fault tolerance in case of power failures (step 714). Thereafter, the process writes the failed drive 
data in the parity position P pos (step 716) and updates non-volatile variable reconstructedRow with 

current ROW number to take care of the fault tolerance in case of power failures (step 718). The 
process repeats until the last row is reached (step 720) and the process ends. 

1 0 RAID 0 in Figure 6B represents the reconstructed RAID 0 from degraded RAID5 after 

following the process illustrated in Figure 7. If power failure occurs after step 714, there will be a 
difference between savedStripe and reconstructedRow variables. Therefore, the present invention 
reads the data from the TemporaryStripe and performs the following steps 716 and 718 for that row 
savedStripe. In all other cases, the process must start from step 712 for the row value 

1 5 reconstructedRow + 1 . 

4.1.1. Stripe Mapping for Reconstructed RAID 0 

The reconstructed RAID 0 stripe mapping is slightly different from the normal RAID 0 
mapping if the failed drive in the original RAID 5 configuration is other than last drive of the 
2 0 logical array. In that case, the mechanism of the present invention must remember the original 

RAID 5 configuration. The reconstructed RAID 0 gives much better read/write performance than 
degraded RAID 5. The complexity of degraded RAID 5 is reduced drastically in case of 
reconstructed RAID 0. 

2 5 4.1.2. Reconstruction of Modified RAID 0 to RAID 5 

The present invention also may reconstruct modified RAID 0 to RAID 5. If the failed 
drive in the original RAID 5 configuration was the last drive, then reconstructed RAID 0 is the 
same as original RAID 0. The RAIDS reconstruction algorithm described above with reference 
to Figure 5 may be used. Figure 8 is a block diagram illustrating an algorithm for reconstructing 
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modified RAID 0 to RAID 5 in accordance with a preferred embodiment of the present 
invention. 

If the failed drive was other than last drive then, the below algorithm must be used to 
reconstruct a RAID 5 volume from modified RAID 0 volume. 
Assumptions: 

• Let c Af be the Maximum number of disks present in RAID 0. 

• Number of disks that will be added to RAID 0 to migrate to RAIDS is 1 . 

• 'TV be the number of disks present in RAID 5 (N = M + 1). 

• Newly added disk number is XI . 

• Assume the failed drive was D3. Map D3 to XL 

• Initialize the Non- Volatile area variables reconstructedRow and updatedDataRow to 0. 
The non-volatile area may be an area in the disk itself or NVRAM. 

• "U is the maximum number of rows to be migrated. 

• Current row is locked for other I/O operations during reconstruction. 
Referring to Figure 8, let {D U D 2 ,D 4 ,X} be the 'Af disks present in RAID 0 and 

{^l j R2 > R ?> > ^4 > R 5 } be the 'U rows that to be migrated from RAID 0 to RAID 5. Figure 9 is a 
flowchart illustrating the operation of a reconstruction algorithm in accordance with a preferred 
embodiment of the present invention. The process begins and, for each row \R\ » ^2 9R3 > ^4 > R 5 } 
(step 902), the process reads the data row from 'AT drives present in RAID 0 (step 904) and 
calculates the parity stripe position P pos (step 906). The formula to find the parity position and 
data position is below. 

Next, a determination is made as to whether the calculated parity stripe position P p0 s falls 

in disk D3 (XI) (step 908). If the calculated parity position does not fall in the failed drive position, 

the process writes the data read from P pos into disk D3 of the corresponding row, which is R\ in 

the example shown in Figure 8 (step 910). Then, the process sets the non- volatile variable 
updatedDataRow to the current row value to indicate the completion of data transferred to the disk 
D3 to take care of the fault tolerance in case of power failures (step 912). 
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Following step 912 or a determination that the parity position falls in the failed drive 

position in step 908, the process calculates the parity P using the read data from RAIDO (step 
914). The formula to calculate the parity is given in below. Then, the process writes the parity into 

the parity stripe position P p0 s (step 916) and updates non- volatile variable reconstructedRow with 

current row number to take care of the fault tolerance in case of power failures (step 918). 

If power failure occurs when power comes up and the updatedDataRow value is more than 
reconstructedRow, then data has been transferred successfully to drive D3. Therefore, the process 
performs steps 904 and 906 and continues to 914 for the row value in updatedDataRow. For all 
other cases, the process starts from step 904 for the row value in 'reconstructedRow + 1 \ 

4.2 Accessing Degraded RAID 5 in the Prior Art 

Using the prior art algorithm, a read operation to the degraded RAID 5 is performed in two 
different methods. If the read operation is for an online drive at the data stripe position Data pos 9 

the read may be issued directly to that stripe at data stripe position Data pos 9 which is shown in the 

Figure 10A. If the read operation is for a failed drive, such as A shown in Figure 10B the prior 
art algorithm must calculate the parity position, read the data stripes and parity stripe, XOR the data 
stripes with parity stripe to get the original data that was present in the drive A . 

If a write to the degraded RAID 5 is received, the prior art algorithm must perform a write 
operation based on the drive to which the write has come. Using the prior art algorithm, a write 
operation to the degraded RAID 5 is performed. For any write, the algorithm first finds the data 
stripe position Data pos to be written. 

CASE1: 

If the data stripe position Data pos falls in the online drive say X 9 and parity position 

P pos also falls in one of the online drives say Pfata as shown in Figure 11, then do the 
following: 

1 . Read the data from Data pos and parity from Ppos ' 
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2. Remove the current data Data data information from the parity Pfata- Use the 

R R R 

formula P t em P data ~ P data ® ^ ata data for removing the data Data data from Pdam . 

3. Update the parity with new data that has to be written in Data pos using the 

R R 
formula Pnewdata = P empdata ® Data newdata • 

4. Write Data newdata [ n Data pos and Pnewdata in P pos - 
CASE 2: 

If the stripe position Data pos falls in the online drive say X and parity position Pp 0S 

falls in the failed drive say P^ata as shown in the Figure 12, then the prior art 
algorithm directly issues a write to the stripe at data position Data pos to the drive X . 

CASE 3: 

If the stripe position Data pos falls in the failed drive and parity position Pp 0S falls in 

an online drives say P^ ta as shown in Figure 13 then the prior art algorithm performs 
the following: 

1 . Issue M reads to the M drives present in the RAID volume. 

2. XOR the data stripes with parity stripe to get the original data. 
(Data data = Pf ata 0 {Data dataX 0 Data data2 0 ) . 

3. Remove the current data Data data information from the parity. 

Ptempdata = Pfata ® D ata data • 

4. Update the parity with new data Data newdata 9 which has to be written in 

R R 

stripe position Data pos . (P newdata = P tempdata ® Data newdata ). 

R R 

5. Write the parity information Pnewdata in P pos . 
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4.3. Accessing Degraded RAID 5 Using the Method of the Present Invention 

In case of the degraded RAID 5, the algorithm of the present invention reconstructs RAID 0 
from the degraded RAID 5. Whether an access is a read or write, the algorithm becomes much 
more simple than the prior art algorithm. For any read or write that comes to an online drive, for 
5 example {D 2 ,D 3 ,D 4 ,X} shown in the Figure 14A, the data stripe position Data pos falls in the 

drives present in the reconstructed RAID 0. In this case the algorithm of the present invention first 
finds the corresponding data stripe position Data pos and then issues the read/write for the stripe at 

data position Data pos , 

For any read or write that comes to the failed drive, for example {A } shown in the Figure 

1 0 14B, the algorithm of the present invention finds the parity position P pos for that row and issues the 

read/write for the stripe at parity position P pos . Therefore, the number of reads and writes are less 

than those of the prior art and also no XOR operations are needed when accessing degraded RAID 
5. 

Figure 15 is a flowchart illustrating the operation of a read/write access in accordance with 
15 a preferred embodiment of the present invention. The process begins by receiving a read or write 
access request for a drive. A determination is made as to whether the data position of the request is 
the failed drive (step 1502). If the data position is not the failed drive, the process issues the 
read/write for the stripe at the data position (step 1504) and ends. However, if the data position is 
the failed drive in step 1502, the process finds the parity position for the row (step 1506) and issues 
2 0 the read/write for the stripe at the parity position (step 1508). Thereafter the process ends. 



4.4. Handling Fault Tolerance 

There are two types of faults that needs to be taken care of. 

In case of power failure during the RLM process: 
25 In the old algorithm, 

The original data already stored in the Temporary Migration Row is used to 
reconstruct the new RAID 5 volume. To find the row that has to be migrated, the 
non-volatile variable migrated row, which has information about the row that has 
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already been migrated, is used. 
In the new algorithm, 

The NVRAM dataMigratedFlag indicates if the data corresponding to the current 
row (migratedRow + 1) is migrated or not. If the flag is set, then only parity has to 
be calculated and needs to be updated. Otherwise, the flag implies that the 
migration has been done to the 'migratedRow ' row and has to be started with next 
row. 

In case of drive failure after RLM: 
In the old algorithm, 

For any read that falls on the data stripe of the failed drive, the entire row should 

be read and the entire data stripe should be XORed with the parity stripe to get the 

correct data. 

In case of a write, 

i) Get the old data by doing XOR on the other data stripes 
with parity stripe. 

ii) Remove the old data information from the parity data 
(XOR the parity with old data and update the parity). 

iii) Calculate the parity (XOR the new data with current parity) 
and write into parity stripe. 

In the new algorithm, 

The new fault tolerance algorithm always reconstructs the degraded RAID 5 
volume to RAID 0. Since RAID 0 is always faster than degraded RAID 5, 1/Os 
are much faster when compared to the prior art algorithm. Also, reconstructing 
this RAID 0 to RAID 5 is done much more efficiently as the algorithm always 
knows the previous RAID 5 configuration. 

5. Complexity Comparison 

To migrate RAID 0 to RAID 5 in case of the prior art algorithm, for each row it requires M 
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reads (Mis the number of disks present in the RAID 0) and 2M + 1 writes. Also, the prior art 
algorithm involves a 'big seek' in the drives (drive spindle moving from current ROW to 
Temporary Migration Row and back to save the data), which is a time consuming operation. 
In the case of the algorithm of the present invention, for each row only M reads and 
5 maximum 2 writes are needed. The present invention does not require any big seek, since there is 
no need for saving the data in the Temporary Migration Row. Therefore, the algorithm of the 
present invention requires only M reads and maximum two writes. This algorithm reduces 2M-1 
writes and the 'big seek' involving the drives for each row. The present invention also reduces the 
processing time and bandwidth. The performance increases drastically when I/O and migration are 
1 0 occurring simultaneously. In addition, in case of any drive failure, the algorithm of the present 
invention performs more efficiently, as it reconstructs the degraded RAID 5 to RAID 0. Since 
RAID 0 always performs much better than degraded RAID 5, the algorithm of the present invention 
is efficient even in the case drive failure. 

15 6. Advantages 

The RATI) Level Migration process becomes very fast, because of the reduced write 
operations (i.e., always maximum two write operations irrespective of any number of drives present 
in the RAID volume). As the number of disks increases in RAID 0, the performance also increases 
during the migration process. In case of a drive failure, the algorithm of the present invention 

2 0 reconstructs the entire RAID 0 volume from the degraded RAID 5. The complexity of 

reconstructed RAID 0 is reduced compared to the degraded RAID 5. Hence, the reconstructed 
RAID 0 gives much better performance than degraded RAID 5. Power failure during the RAID 
volume migration process is taken care in a very efficient way by simply using a non- volatile 
variable instead of using a Temporary Migration Row, as in the prior art. By not using the 

2 5 Temporary Migration Row, the present invention saves M writes per row, where M is the number of 
disks present in the RAID 0. The algorithm of the present invention performs better by avoiding 
the 'big seek 5 during the migration process, because the present invention does not use the 
Temporary Migration Row. 
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7. Conclusion 

Over the years, RAID technology has gathered momentum and has become de-facto storage 
paradigm for servers. With the advancement of technology, storage administrators are 
implementing redundant storage pools with bigger, faster, and more drives. To keep the cost per 
MB capacity of storage low, there is a big push for going towards RAID 5 instead of RAID 1 or 
RAID 10. This works very well for most server applications, though there is some write penalty 
that has to be paid for random writes. This, however, is quantified and controlled irrespective of 
the size of the RAID Array. In the worst case, in read-modify-write updates, one block write may at 
most result in two reads and two writes. The problem, however, occurs in the event of a drive 
dropping out. In this situation, if the array is large, then even for reads that map into the failed 
drive, the algorithm must read all the other (N-l) Drives and do a large XOR to re-compute the lost 
data. Also, for writes, more VOs may be required. The present invention dramatically reduces this 
by allowing to quickly migrate non-redundant RAID 5 to RAID 0, and then, when a new drive 
becomes available, it will allow the conversion into fully redundant RAID 5. This new striping for 
RAID 5 has all the merits of ensuring performance and redundancy by striping data and rotating 
parity. Yet, data is striped so as not to sacrifice any merits for performance and redundancy and yet 
to allow quick migration from redundant to non-redundant RAID levels. 



